umpireEDA.ipynb

Description

This notebook contains the exploration done into the data collected and uploaded to AWS in other notebooks. It utilizes the eda module built into the baseball package. Some of the visualizations produced within this notebook are included in the final report, but there are also additional explorations done here.

Part of Baseball Umpire Analysis Package for UMSI SIADS-591 & 592 Milestone I by Anthony Giove (agiove@umich.edu), Avinash Reddy (avimads@umich.edu), and Ryan Maley (rjmaley@umich.edu).

In [1]:
# importing the baseball package and our aws credentials. when baseball is imported,
# the imports necessary for any module within baseball are passed into this workspace.
from baseball import *
import aws
requests imported
bs4 imported
psycopg2 imported
mp imported
pd imported
np imported
mpl imported
plt imported
patches imported
widgets imported
wget imported
zipfile imported
imports complete

Pull data from AWS

All of our data is housed on AWS. Utilizing ETL we created a pitches_expanded table which contains relevant content from many different tables that we created while parsing data.

In [2]:
# sql statement needed to pull data we want to explore
baseSql = '''SELECT pitch_type, stand, p_throws, type, hit_location, bb_type,balls,strikes,pfx_x,pfx_z, inning,
description, game_type, plate_x, plate_z, game_year, daynight, on_3b,on_2b, on_1b, outs_when_up, inning_topbot,
 sz_bot, launch_speed, launch_angle, pitch_number, at_bat_number, bat_score, fld_score,
umpname, pitcher_name, height_in FROM pitches_expanded;'''
In [3]:
# using the sql module within baseball, we can directly create a dataframe while running oursql query
pitch = sql.createDF(sql.connect(aws.creds), baseSql)
pitch
Out[3]:
pitch_type stand p_throws type hit_location bb_type balls strikes pfx_x pfx_z ... sz_bot launch_speed launch_angle pitch_number at_bat_number bat_score fld_score umpname pitcher_name height_in
0 FT L R S NaN None 0.0 0.0 -1.152875 1.275000 ... 1.32 NaN NaN 1.0 27.0 0.0 3.0 David Rackley Miguel Gonzalez 73
1 FF L R B NaN None 0.0 0.0 -0.824442 1.444133 ... 1.94 NaN NaN 1.0 28.0 0.0 3.0 David Rackley Miguel Gonzalez 73
2 SL R R B NaN None 0.0 0.0 0.412750 -0.321733 ... 1.50 NaN NaN 1.0 29.0 4.0 0.0 James Hoye P.J. Walters 76
3 CU L R B NaN None 1.0 1.0 0.997250 -0.728800 ... 1.50 NaN NaN 3.0 27.0 0.0 0.0 Mark Carlson Stephen Strasburg 77
4 FF L R B NaN None 0.0 0.0 -0.622650 1.232000 ... 1.49 NaN NaN 1.0 31.0 3.0 1.0 Sam Holbrook Wily Peralta 73
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1460633 SL L R S NaN None 0.0 0.0 0.371000 0.386333 ... 1.61 NaN NaN 1.0 34.0 0.0 2.0 Fieldin Culbreth Hisashi Iwakuma 75
1460634 SL R R S NaN None 0.0 0.0 0.970808 -0.032200 ... 1.73 NaN NaN 1.0 34.0 0.0 1.0 Marvin Hudson Jhoulys Chacin 75
1460635 FF R R B NaN None 0.0 1.0 -0.476525 2.212400 ... 1.52 NaN NaN 2.0 29.0 0.0 3.0 David Rackley Miguel Gonzalez 73
1460636 FF R R S NaN None 0.0 0.0 -0.622650 1.961567 ... 1.52 NaN NaN 1.0 29.0 0.0 3.0 David Rackley Miguel Gonzalez 73
1460637 FT L R B NaN None 1.0 1.0 -1.065200 1.164633 ... 1.50 NaN NaN 3.0 28.0 0.0 0.0 Mark Carlson Stephen Strasburg 77

1460638 rows × 32 columns

Alternate Path (for instructors Only)

Running the cell below will create the same dataframe as the aws query, without need for credentials

In [ ]:
pitch = pd.read_csv('pitches_expanded.csv')

Data Cleaning, Maniuplation, and Initial Visuals

Now that we have the data needed for exploration we will run some functions from the eda module to prepare the data

In [4]:
# dataPrep bins the plate_x and plate_z columns so that comparisons and groupbys can be performed
pitch = eda.dataPrep(pitch)
pitch
Out[4]:
pitch_type stand p_throws type hit_location bb_type balls strikes pfx_x pfx_z ... at_bat_number bat_score fld_score umpname pitcher_name height_in typeExp runDiff xBin zBin
0 FT L R S NaN None 0.0 0.0 -1.152875 1.275000 ... 27.0 0.0 3.0 David Rackley Miguel Gonzalez 73 1.0 -3.0 (-1.0, -0.917] (2.083, 2.167]
1 FF L R B NaN None 0.0 0.0 -0.824442 1.444133 ... 28.0 0.0 3.0 David Rackley Miguel Gonzalez 73 0.0 -3.0 (-1.25, -1.167] (2.5, 2.583]
2 SL R R B NaN None 0.0 0.0 0.412750 -0.321733 ... 29.0 4.0 0.0 James Hoye P.J. Walters 76 0.0 4.0 (0.833, 0.917] (1.25, 1.333]
3 CU L R B NaN None 1.0 1.0 0.997250 -0.728800 ... 27.0 0.0 0.0 Mark Carlson Stephen Strasburg 77 0.0 0.0 (0.667, 0.75] (2.0, 2.083]
4 FF L R B NaN None 0.0 0.0 -0.622650 1.232000 ... 31.0 3.0 1.0 Sam Holbrook Wily Peralta 73 0.0 2.0 (-1.25, -1.167] (2.417, 2.5]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1460633 SL L R S NaN None 0.0 0.0 0.371000 0.386333 ... 34.0 0.0 2.0 Fieldin Culbreth Hisashi Iwakuma 75 1.0 -2.0 (-0.167, -0.0833] (2.833, 2.917]
1460634 SL R R S NaN None 0.0 0.0 0.970808 -0.032200 ... 34.0 0.0 1.0 Marvin Hudson Jhoulys Chacin 75 1.0 -1.0 (-0.583, -0.5] (2.417, 2.5]
1460635 FF R R B NaN None 0.0 1.0 -0.476525 2.212400 ... 29.0 0.0 3.0 David Rackley Miguel Gonzalez 73 0.0 -3.0 (0.583, 0.667] (1.333, 1.417]
1460636 FF R R S NaN None 0.0 0.0 -0.622650 1.961567 ... 29.0 0.0 3.0 David Rackley Miguel Gonzalez 73 1.0 -3.0 (0.667, 0.75] (1.917, 2.0]
1460637 FT L R B NaN None 1.0 1.0 -1.065200 1.164633 ... 28.0 0.0 0.0 Mark Carlson Stephen Strasburg 77 0.0 0.0 (-1.417, -1.333] (3.417, 3.5]

1460638 rows × 36 columns

In [5]:
# the umpireFilter function from eda allows us to specify a cutoff (limit) for the number of strikes
# the umpire has called within the collected data. For this notebook run, the threshold is 9,500
umps = eda.umpireFilter(pitch, limit=9500) 
Chris Guccione = 9706
James Hoye = 10867
Marvin Hudson = 10515
Tom Hallion = 9833
Jerry Meals = 9948
Adrian Johnson = 9793
Joe West = 9637
Jeff Nelson = 9763
Eric Cooper = 9792
Jeff Kellogg = 9779
Jim Reynolds = 9648
Lance Barksdale = 9566
Mark Wegner = 9755
Ted Barrett = 10187
Angel Hernandez = 10530
Tim Timmons = 10205
Bill Miller = 10149
Gary Cederstrom = 9740

Initial Visualizations

Ultimately went a different direction, but these were two approaches considered. The convex hull visual shows the outer boundary of what umpires deemed a strike, while the flat strike zone shows the bounds as a rectangle taking the highest and lowest values in both axes.

In [6]:
# initial exploration visualization 1
eda.convexHull(pitch, umps)
In [7]:
# initial exploration visualization 2.
eda.flatStrikeZone(pitch)

Preparing dataframes for final visualizations and data exploration

This section details additional cleaning done to prepare for the final visualizations and report generation. Utilizing the eda module we create dictionaries to house our dataframe for raw pitch data (ump_dfs, umpAll_dfs), groupby data (ump_dfsG, umpAll_dfsG) and comparison groupby data (ump_dfsC, umpAll_dfsC). By utilizing dictionaries it allows us to iterate though multiple dictionaries with the same keys and quickly create visualizations.

In [9]:
# creating the raw data dictionaries for individual umps and all umps
ump_dfs = eda.dfUmpDict(pitch, umps)
umpAll_dfs = eda.dfUmpDict(pitch)
In [10]:
# printing the keys from both dictionaries. within each individual ump dictionary,
# the keys shows with umpAll_dfs.keys() are nested
print(ump_dfs.keys(), umpAll_dfs.keys())
dict_keys(['Chris Guccione', 'James Hoye', 'Marvin Hudson', 'Tom Hallion', 'Jerry Meals', 'Adrian Johnson', 'Joe West', 'Jeff Nelson', 'Eric Cooper', 'Jeff Kellogg', 'Jim Reynolds', 'Lance Barksdale', 'Mark Wegner', 'Ted Barrett', 'Angel Hernandez', 'Tim Timmons', 'Bill Miller', 'Gary Cederstrom']) dict_keys(['Overall', 'StandL', 'StandR', 'ThrowL', 'ThrowR', 'Begin', 'End', 'Short', 'Tall', 'Day', 'Night', 'Close', 'Open', 'Reg', 'Post'])
In [11]:
# grouping the raw data based on x and z bins for all umpires
umpAll_dfsG = {}
for key, val in umpAll_dfs.items():
    umpAll_dfsG[key] = eda.zoneBinGroupby(val)
In [12]:
# grouping the raw data based on x and z bins for every individual umpire
ump_dfsG = {}
for ump, v in ump_dfs.items():
    ump_dfsG[ump] = {}
    for key, val in v.items():
        ump_dfsG[ump][key] = eda.zoneBinGroupby(val)
In [13]:
# creating dictionary of dfs for every individual umpire compared with 
# all umpires for each condition we want to investigate
ump_dfsC = {}
for ump, v in ump_dfsG.items():
    ump_dfsC[ump] = {}
    for key, val in v.items():
        ump_dfsC[ump][key] = eda.compareDF(val, umpAll_dfsG[key])

Final Visualization

Now we have all of our prepped dataframes within dictionaries. Utilizing the eda module's zoneHexViz function we can pass in different data frame to visualize different scenarios and conditions. The structure used allows for an arg (diff) to be passed which changes the image included in visualization and shows how something different instead of the raw data.

Investigating Side of Plate Batter Hits From

The next two visuals look at the difference between left handed and right handed batters. As expected the strike zones shift based on the handedness of the batter. The second visual includes a comparison plot which highlights the differences.

In [14]:
eda.zoneHexViz(umpAll_dfs['StandL'], umpAll_dfs['StandR'], 
               title1 = 'All Umpires - Left Handed Batters', 
               title2 = 'All Umpires - Right Handed Batters', gridSize=30)
In [15]:
dfC = eda.compareDF(umpAll_dfsG['StandR'], umpAll_dfsG['StandL'])
eda.zoneHexViz(umpAll_dfsG['StandR'], dfC, title1 = 'All Umpires - Left Handed Batters', 
               title2 = 'All Umpires - Right Handed Batters Compared to Leftys', diff2 = True, gridSize=18)

Investigating Handedness of Pitcher

The next two visuals look at the difference between left handed and right handed pitchers. As expected the strike zones shift based on the handedness of the pitcher but not as significantly as it does when looking at batter handedness. The second visual includes a comparison plot which highlights the differences.

In [16]:
eda.zoneHexViz(umpAll_dfs['ThrowL'], umpAll_dfs['ThrowR'], 
               title1 = 'All Umpires - Left Handed Pitchers', 
               title2 = 'All Umpires - Right Handed Pitchers', gridSize=30)
In [17]:
dfC = eda.compareDF(umpAll_dfsG['ThrowR'], umpAll_dfsG['ThrowL'])
eda.zoneHexViz(umpAll_dfsG['ThrowR'], dfC, title1 = 'All Umpires - Left Handed Pitchers', 
               title2 = 'All Umpires - Right Handed Pitchers Compared to Leftys', diff2 = True, gridSize=18)

Investigating Beginning and End of Game

The next two visuals look at the difference between beginning and end of game. As expected the strike zones are very similar and do not show much change between beginning and end of game. The second visual includes a comparison plot which highlights the differences.

In [18]:
eda.zoneHexViz(umpAll_dfs['Begin'], umpAll_dfs['End'], 
               title1 = 'All Umpires - Beginning of Game (1st and 2nd innings)', 
               title2 = 'All Umpires - End of Game (8th inning and later)', gridSize=30)
In [19]:
dfC = eda.compareDF(umpAll_dfsG['End'], umpAll_dfsG['Begin'])
eda.zoneHexViz(umpAll_dfsG['End'], dfC, title1 = 'All Umpires - End of Game (8th inning and later)', 
               title2 = 'All Umpires - End of Game Compared to Beginning of Game', diff2 = True, gridSize=12)

Investigating Batter Heights

The next two visuals look at the difference between short (5'10" and shorter) and tall (6'2" and taller) players. As expected the strike zones shift based on the height of the batter. As player height increases the strike zone rises. The second visual includes a comparison plot which highlights the differences.

In [20]:
eda.zoneHexViz(umpAll_dfs['Short'], umpAll_dfs['Tall'], 
               title1 = 'All Umpires - Batters 5ft 10in and Shorter', 
               title2 = 'All Umpires - Batters 6ft 2in and Taller', gridSize=12)
In [21]:
dfC = eda.compareDF(umpAll_dfsG['Short'], umpAll_dfsG['Tall'])
eda.zoneHexViz(umpAll_dfsG['Short'], dfC, title1 = 'All Umpires - Batters 5ft 10in and Shorter', 
               title2 = 'All Umpires - Short Batters Compared to Tall', diff2 = True, gridSize=10)

Investigating Time of Game

The next two visuals look at the difference between day and night games. As expected the strike zones is very similar with minor differences. The second visual includes a comparison plot which highlights the differences.

In [22]:
eda.zoneHexViz(umpAll_dfs['Day'], umpAll_dfs['Night'], 
               title1 = 'All Umpires - Day Games', 
               title2 = 'All Umpires - Night Games', gridSize=30)
In [23]:
dfC = eda.compareDF(umpAll_dfsG['Day'], umpAll_dfsG['Night'])
eda.zoneHexViz(umpAll_dfsG['Day'], dfC, title1 = 'All Umpires - Day Games', 
               title2 = 'All Umpires - Day Compared to Night Games', diff2 = True, gridSize=18)

Investigating Regular Season and Post Season Games

The next two visuals look at the difference between regular season and post season games. Overall the zones are very similar with a few spots which adjust the autoscaled axes and give those higher values. This shows that umpire strikes zones dont change much between regular season and post season. The second visual includes a comparison plot which highlights the differences.

In [24]:
eda.zoneHexViz(umpAll_dfs['Reg'], umpAll_dfs['Post'], 
               title1 = 'All Umpires - Regular Season Games', 
               title2 = 'All Umpires - Post Season Games', gridSize=20)
In [25]:
dfC = eda.compareDF(umpAll_dfsG['Post'], umpAll_dfsG['Reg'])
eda.zoneHexViz(umpAll_dfsG['Post'], dfC, title1 = 'All Umpires - Post Season Games', 
               title2 = 'All Umpires - Post Season Compared to Regular Season Games', diff2 = True, gridSize=18)

Additional EDA

Now that we have explored the umpires as a whole through differnet game situations we can begin looking at individual umpires. To do this we will calcualte percentage differences between individual umpires and the collective whole by utilizing the groupby dataframes. By binning all of the data we created 1080 data points per dataframe for every umpire in every scenario. This allows us to normalize the comparisons even though every umpire would have a different number of pitches in each bin.

In [26]:
# creating new dictionary to house the umpires, keys, and percentage values
ump_dfsG_mean = {}
for ump, varDF in ump_dfsG.items():
    ump_dfsG_mean[ump] = {}
    for var in varDF.keys():
        ump_dfsG_mean[ump][var] = round(100*((ump_dfsG[ump][var]['typeExp'].mean()-
                                              umpAll_dfsG[var]['typeExp'].mean())/
                                    umpAll_dfsG[var]['typeExp'].mean()),3)
In [27]:
# now that we have a dictionary with all the information, we can convert it to a dataframe
# using dict comprehension and some other pandas functionality
dfUmp = pd.concat({k: pd.DataFrame.from_dict(v, 'index') for k, v in ump_dfsG_mean.items()}, axis=0)
dfUmp = dfUmp.unstack(level=-1)
dfUmp.columns = dfUmp.columns.get_level_values(1)
dfUmp.sort_values(by='Overall',ascending=False, inplace=True)
dfUmp.reset_index(inplace = True)
dfUmp.rename(columns={'index': 'umpname'}, errors='raise', inplace=True)
display(dfUmp)
display(dfUmp.describe())
umpname Overall StandL StandR ThrowL ThrowR Begin End Short Tall Day Night Close Open Reg Post
0 Bill Miller 9.418 9.432 13.300 12.057 9.658 11.142 10.862 21.362 8.541 10.121 9.548 10.080 10.208 9.399 29.752
1 Angel Hernandez 4.358 4.574 4.220 9.294 3.700 7.990 3.117 15.736 4.072 5.528 4.503 4.744 5.279 4.269 15.637
2 Eric Cooper 2.993 2.893 5.009 4.524 3.280 3.769 7.380 13.680 3.068 4.050 3.231 3.649 1.906 2.969 12.945
3 Jeff Nelson 2.694 7.392 1.170 2.679 3.231 6.377 3.468 14.198 2.915 2.710 2.855 2.487 6.602 2.320 29.286
4 Ted Barrett 2.005 1.946 1.613 0.830 3.317 4.203 3.239 18.863 2.276 3.076 1.953 1.566 6.409 1.941 15.223
5 Gary Cederstrom 1.240 3.128 0.021 4.538 0.979 5.131 2.870 22.255 1.081 1.808 1.420 1.576 3.298 1.285 6.260
6 Marvin Hudson 0.487 -0.259 2.298 3.033 0.065 3.261 3.011 10.276 0.870 0.219 1.144 0.430 4.103 0.586 -0.094
7 Chris Guccione -0.119 1.960 -2.382 0.526 0.671 -0.308 2.624 11.600 0.053 0.562 -0.700 -0.245 0.780 -0.184 19.997
8 Jim Reynolds -0.166 -0.919 0.325 2.324 0.034 -0.238 3.534 10.808 0.392 -1.680 0.922 -0.407 3.787 -0.236 8.648
9 Tim Timmons -0.316 -0.321 1.561 1.030 0.110 -0.915 0.672 20.802 -0.910 -0.956 0.256 -0.406 1.697 -0.168 1.744
10 Lance Barksdale -0.456 -1.511 4.901 3.784 -1.698 1.874 3.578 6.822 -0.478 1.195 -0.697 -0.221 2.698 -0.424 1.562
11 Jeff Kellogg -0.566 3.666 -2.002 0.918 -0.488 -0.175 1.569 19.222 -1.038 -0.601 -0.174 -0.652 1.553 -0.632 13.399
12 Adrian Johnson -1.537 1.946 -4.804 1.407 -1.866 -0.872 2.479 6.796 -1.264 -1.100 -1.103 -1.754 1.823 -1.511 NaN
13 Jerry Meals -2.285 -2.336 -2.703 2.665 -2.424 -0.823 0.240 14.447 -2.241 -1.142 -1.918 -1.983 -1.531 -2.331 8.530
14 Joe West -2.617 -2.194 -0.923 -1.919 -1.879 -1.391 -0.486 15.365 -2.557 0.119 -3.004 -2.463 0.444 -2.716 17.658
15 Mark Wegner -2.850 -0.615 -3.102 -0.619 -3.016 -1.625 -1.164 3.922 -2.892 -2.465 -2.846 -2.201 -2.018 -2.908 7.783
16 James Hoye -3.312 -0.407 -4.304 -2.069 -3.035 0.304 -1.820 7.566 -3.200 -3.102 -2.667 -2.553 -4.242 -3.283 18.306
17 Tom Hallion -3.487 -4.239 -3.498 -3.896 -2.829 0.713 0.463 8.811 -3.487 -0.229 -4.692 -3.795 -1.400 -3.549 6.797
Overall StandL StandR ThrowL ThrowR Begin End Short Tall Day Night Close Open Reg Post
count 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 17.000000
mean 0.304667 1.340889 0.594444 2.283667 0.433889 2.134278 2.535333 13.473944 0.288944 1.006278 0.446167 0.436222 2.299778 0.268167 12.554882
std 3.202790 3.491605 4.409504 3.838730 3.249662 3.651228 2.996986 5.546026 3.060699 3.212226 3.313576 3.298948 3.517079 3.188683 8.776326
min -3.487000 -4.239000 -4.804000 -3.896000 -3.035000 -1.625000 -1.820000 3.922000 -3.487000 -3.102000 -4.692000 -3.795000 -4.242000 -3.549000 -0.094000
25% -2.098000 -0.843000 -2.622750 0.602000 -1.875750 -0.694250 0.515250 9.177250 -1.996750 -1.064000 -1.714250 -1.925750 0.528000 -2.126000 6.797000
50% -0.241000 0.843500 0.173000 1.865500 0.049500 0.508500 2.747000 13.939000 -0.212500 0.169000 0.041000 -0.325500 1.864500 -0.210000 12.945000
75% 1.813750 3.069250 2.126750 3.596250 2.668000 4.094500 3.410750 18.081250 1.977250 2.484500 1.819750 1.573500 4.024000 1.777000 17.658000
max 9.418000 9.432000 13.300000 12.057000 9.658000 11.142000 10.862000 22.255000 8.541000 10.121000 9.548000 10.080000 10.208000 9.399000 29.752000

dfUmp Analysis

looking at the above dataframe, we can see which umpires are more likely or less likely to call strikes given differnet scenarios. This raw number allows of to feed umpire names and columns as keys into our dict (ump_dfsG_mean) to see a corresponding visualization. For demo we will show Bill Miller (most aggressive overall strikezone) and Tom Hallion (most conservative overall strikezone) and as expected Bill Miller has a lot more red in his comparison plot than Hallion who has more blue.

In [28]:
# Bill Miller - Most Aggressive Strike Calling Umpire
eda.zoneHexViz(ump_dfsG['Bill Miller']['Overall'], ump_dfsC['Bill Miller']['Overall'], 
               title1 = 'Bill Miller - Overall', title2 = 'Compared with All Umpires - Overall', diff2 = True, gridSize=15)
In [29]:
# Tom Hallion - Most Conservative Strike Calling Umpire
eda.zoneHexViz(ump_dfsG['Tom Hallion']['Overall'], ump_dfsC['Tom Hallion']['Overall'], 
               title1 = 'Tom Hallion - Overall', title2 = 'Compared with All Umpires - Overall', diff2 = True, gridSize=15)

Comparing Bill Miller and Tom Hallion

The visualizations below will show the differences between the most aggressive and most conservative umpires in the game of baseball. As you can see in the images below, there is a significant difference in the sizes and shapes of the strike zones of these two umpires. Utilizing the built in functionality of comparison visuals, we can easily see it quantifed with respect to either umpire when compared to the other.

In [38]:
eda.zoneHexViz(ump_dfsG['Bill Miller']['Overall'], ump_dfsG['Tom Hallion']['Overall'], title1 = 'Bill Miller - Overall', 
               title2 = 'Tom Hallion - Overall', gridSize=14)
In [39]:
dfC = eda.compareDF(ump_dfsG['Bill Miller']['Overall'], ump_dfsG['Tom Hallion']['Overall'])
eda.zoneHexViz(ump_dfsG['Bill Miller']['Overall'], dfC, title1 = 'Bill Miller - Overall', 
               title2 = 'Compared to Tom Hallion', diff2 = True, gridSize=14)
In [40]:
dfC = eda.compareDF(ump_dfsG['Tom Hallion']['Overall'],ump_dfsG['Bill Miller']['Overall'])
eda.zoneHexViz(ump_dfsG['Tom Hallion']['Overall'], dfC, title1 = 'Tom Hallion - Overall', 
               title2 = 'Compared to Bill Miller', diff2 = True, gridSize=14)

Full Individual Umpire Report

We now have the ability to quickly visualize how an umpire compares to his peers, the cell below iterates through all scenarios for Joe West to give an idea of how he calls games.

In [30]:
# setting the viz up for Joe West
for key in ump_dfs['Joe West'].keys():
    eda.zoneHexViz(ump_dfs['Joe West'][key], ump_dfsC['Joe West'][key], title1 = 'Joe West - '+key, 
                   title2 = 'Compared to All Umpires - '+key, gridSize=10, diff2 = True)

Create umpire reports

Creating visualizations inside of a notebook is great for exploring, we wanted to add the ability to create documents that could be distributed to anyone interested. The vizGenerator built into our eda module allows that. You can change the report type from raw to comparison based on one argument. The visualizations can also be supressed or enabled depending on whether you want a document or embedded visualizations. The vizGenerator automatically iterates through the available umpire keys in the dictionary passed in. It also hits the nested dictionaries which creates a report for every umpire similar to the visuals shown above for Joe West.

In [31]:
### create a dictionary with all the keys and values that we want in the report below
reportViews = {}
reportViews['Overall'] = 'Overall Strike Zone'
reportViews['StandL'] = 'Strike Zone Left handed Batters'
reportViews['StandR'] = 'Strike Zone Right handed Batters'
reportViews['ThrowL'] = 'Strike Zone Left handed Pitchers'
reportViews['ThrowR'] = 'Strike Zone Right handed Pitchers'
reportViews['Begin'] = 'Strike Zone at the Begining of the game (<=2 innings)'
reportViews['End'] = 'Strike Zone at the End of the game (>=8 innings)'
reportViews['Short'] = "Strike Zone for shorter batters (less than 5'10)"
reportViews['Tall'] = "Strike Zone for Taller batters (greater than 6'2)"
reportViews['Day'] = 'Strike Zone for games happening during day time'
reportViews['Night'] = 'Strike Zone for games happening during Night time'
reportViews['Close'] = 'Strike Zone for games that are tight (diff in score <3)'
reportViews['Open'] = 'Strike Zone for games where one team has lead >3'
reportViews['Reg'] = 'Strike Zone for regular season games'
#reportViews['Post'] = 'Strike Zone for post season games' leaving off post season
In [33]:
# passing in the dictionaries (ump_dfsG and ump_dfsC) to iterated through and generate report
# these reports are word documents and a statement is printed to show successful completion.
# the report is housed in the /Report folder within this directory
eda.vizGenerator(ump_dfsG, ump_dfsC, vizSections= reportViews,comparison = True, generateReport=True)
Report is created for Chris Guccione in the Report folder
Report is created for James Hoye in the Report folder
Report is created for Marvin Hudson in the Report folder
Report is created for Tom Hallion in the Report folder
Report is created for Jerry Meals in the Report folder
Report is created for Adrian Johnson in the Report folder
Report is created for Joe West in the Report folder
Report is created for Jeff Nelson in the Report folder
Report is created for Eric Cooper in the Report folder
Report is created for Jeff Kellogg in the Report folder
Report is created for Jim Reynolds in the Report folder
Report is created for Lance Barksdale in the Report folder
Report is created for Mark Wegner in the Report folder
Report is created for Ted Barrett in the Report folder
Report is created for Angel Hernandez in the Report folder
Report is created for Tim Timmons in the Report folder
Report is created for Bill Miller in the Report folder
Report is created for Gary Cederstrom in the Report folder
In [ ]: